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DETAILED ACTION 
Continued Examination Under 37 CFR LI 14 

1 . A request for continued examination under 37 CFR 1.114, including the fee set forth in 
37 CFR 1 .17(e), was filed in this application after final rejection. Since this application is 
eligible for continued examination under 37 CFR 1.1 14, and the fee set forth in 37 CFR 1.17(e) 
has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 
37 CFR 1.114. Applicant's submission filed on 9/25/2007 has been entered. 

Claim Rejections - 35 USC § 103 

1 . The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subject matter as a whole would have been obvious at the time the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 

2. Claims 22-25, 27, 29-32, and 34 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Ezzat et al. (NPL document, "Visual Speech Synthesis by Morphing 
Visemes", herein referred to as "Ezzat") in view of Jiang et al. (NPL document, "Visual Speech 
Analysis with Application to Mandarin Speech Training", herein referred to as "Jiang") in view 
of Cox et al. (NPL Doc, "Speech and language processing for next-millennium communications 
services"). 

As per claims 22, 23, and 30, Ezzat teaches the claimed "selecting" step on top of 1 st 
column on pg. 51 and states: 
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"there are many intermediate frames that lie between the chosen viseme images ... 
Consequently, we compute a series of consecutive optical flowvectors between each 
intermediate image and its successor, and concatenate them all into one large flow vector 
that defines the global transformation between the chosen visemes". (emphasis added) 

And states in the abstract: 

we are able to synchronize the visual speech stream with the audio speech stream, and 
hence give the impression of a photorealistic talking face, 
(emphasis added) 

Here, the visemes represent a generic facial image that can be use to describe a particular 
sound and the flowvectors which contain visual and sound features are used in conjunction with 
the visemes. 

Ezzat does not explicitly teach the claimed "obtaining" step. Jiang teaches the claimed 

"obtaining" step by stating in the abstract: 

At each frame, region of interest is identified and 
key information is extracted. The preprocessed acoustic 
and visual information are then fed into a modular TDNN 
and combined for visual speech analysis, (emphasis added) 

states on (pg. 114, 4.2 Acoustic and Visual Input Representation, 1st paragraph): 

For acoustic data representation, we have followed 
the well-established approach to apply FFT on the Hamming 
windowed speech data to get 16 Melscale Fourier coefficients as 
input to the Acoustic input Layer. For visual data representation, 
we have performed the lip-tracking and feature points extraction 
task by applying our 2D multi-state lip shape model. Then we 
use both the color profile of the feature points on external and 
internal boundaries and position and movement of lip boundaries 
for feature extraction using principle component analysis (PCA). 
The extracted feature vectors are then fed to the Visual Input 
Layer, (emphasis added) 

Here, the Jiang teaches feature vectors (target feature vector) and teaches of visual data 
(visual features) and acoustic information (non-visual information). It would have been obvious 
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to one of ordinary skill in the art at the time of invention to combine Ezzat with Jiang. Jiang 
teaches one advantage to obtaining feature vectors in order to help children improve their speech 
pronunciation (see section 5, pgs. 1 14-1 15, 1 st paragraph) by providing audio-visual feedback. 

Ezzat does not explicitly teach the claimed "unit selection process". 
Cox teaches the claimed: 

Unit selection process in which a longest possible candidate image sample is selected (in figure 
2 and bottom of 2 nd col on page 1318, "There are two good reasons why the method of unit 
selection synthesis is capable of producing customer-quality or even natural-quality speech 
synthesis. First, on-line selection of speech segments allows for longer units (whole words, 
potentially even whole sentences) to be used in the synthesis if they are found in the inventory . " 
In addition, the references teaches of image samples, towards the middle of the 1 st col on page 
1319, "synthesized using photo-realistic two-dimensional image technologies (sample-based 
VTTS) " where the unit selection process is used with these samples because there is a 
correspondence between the audio and visual TTS). 

It would have been obvious to one of ordinary skill in the art at the time of invention to combine 
Ezzat, Jiang, and Cox. Cox teaches one advantage of the combination (bottom of 2nd col on 
page 1318, "There are two good reasons why the method of unit-selection synthesis is capable of 
producing customer-quality or even natural-quality speech synthesis) where this can improve the 
output of Ezzat. 
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As per claims 24-25, and 31-32, Ezzat teaches the claimed "selecting ... using a 

comparison of a combination of visual features and non-visual features with the target feature 

vector" by stating on pg. 47, 2 nd col, 2 nd paragraph: 

For any input text, we determine the appropriate sequence of viseme morphs to make, 
as well as the rate of the transformations by utilizing the output of the natural language 
processing unit (emphasis added) 

In order to determine the appropriate sequence, the system would have to perform a comparison 

of visual and non-visual features with a given target vector in order to produce the output as 

stated. Further, this construction process of an appropriate sequence of viseme morphs would 

require selecting candidate image samples where these samples could be used to transition 

between through transformation. 

Ezzat teaches the claimed compiling by teaching of concatenation (see quote from top of 

1 st column on pg. 51 above). 

As per claim 27 and 34, Ezzat teaches the claimed first database by teaching of recording 
and collecting one image per English phoneme (bottom of 1 st column on pg. 47 under "Corpus 
and Viseme Acquisition", also see figure 2). 

Ezzat teaches the claimed second and third database by teaching of "Flow database" (pg. 
54, 2 nd column), which contain optical flow vectors which specify transition data between 
visemes (includes visual data and includes storing non-visual data i.e. sound transitions). 
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As per claim 29, Ezzat teaches the claimed first database in figure 2, the claimed second 
database and the claimed third database on pg. 54, 2 nd column under "Flow database" where this 
database is formed to specify visual and non-visual data between animation transitions (frames). 

3. Claims 28 and 35 are rejected under 35 U.S.C. 103(a) as being unpatentable over Ezzat in 
view of Jiang in further view of Cox in further of view of Brand (NPL Document, "Voice 
Puppetry", herein referred to as "Brand"). 

As per claims 28 and 35, Ezzat does not teach the claimed limitations. 

Brand teaches the claimed "selecting ... a number of candidates" and the claimed 

"Viterbi search" by stating on the bottom half of the 1 st col on pg. 25: 

The Viterbi sequence, while most likely, may only represent a small fraction of the total 
probability mass — there may be thousands of slightly different state sequences that 
are nearly as likely. If this were to happen in the voice puppet, V would be a very poor 
representation of the relevant information 
in the audio, and the animation quality would suffer greatly. 

. . . These problems are virtually banished with entropically estimated models because 
entropy minimization concentrates the probability mass on the optimal Viterbi 
sequence, (emphasis added) 

Brand teaches the claimed concatenation cost by stating on pg. 26, very bottom of 1 st col 

and very top of 2 nd col: 

We quantified this with a squared error measure of divergence between groundtruth (x) 
and reconstructed (y) facial motion vectors, weighted to penalize motions in the wrong 
direction, (emphasis added) 

It would have been obvious to one of ordinary skill in the art at the time of invention to combine 
Brand with the combinable system of Ezzat, Jiang, and Cox. Brand teaches the advantage of 
using an optimal Viterbi sequence with a large number of state sequences (candidates) to reduce 
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the size to the most optimal ones in order to remove poor animation quality (1 st col on pg. 25 see 
quote above). 

Response to Arguments 
Applicant's arguments with respect to the claims have been considered but are moot in view of 
the new ground(s) of rejection. 

Conclusion 

2. The prior art made of record and not relied upon is considered pertinent to applicant's 
disclosure: Breen et al. - US Patent 6,208,356. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Daniel F. Hajnik whose telephone number is (571) 272-7642. 
The examiner can normally be reached on Mon-Fri (8:30A-5:00P). 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Ulka J. Chauhan can be reached on (571) 272-7782. The fax phone number for the 
organization where this application or proceeding is assigned is 571-273-8300. 
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Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would 
like assistance from a USPTO Customer Service Representative or access to the automated 
information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 
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